A Dynamic Programming Approach To Document Clustering Based On Term Sequence Alignment
نویسندگان
چکیده
Document clustering is unsupervised machine learning technique that, when provided with a large document corpus, automatically sub-divides it into meaningful smaller sub-collections called clusters. Currently, document clustering algorithms use sequence of words (terms) to compactly represent documents and define a similarity function based on the sequences. We believe that the word sequence is vital in determining the contextual information of a document. A frequent sequence, maximal sequence or multi-word sequence cannot alone give good contextual information. These sequences can affect the local similarity computation and can compromise the accuracy of the final clusters. Motivated by this, we propose in this paper a dynamic programming approach to document clustering based on term sequence alignment. There are three main contribution through this research: (i) a document representation model is proposed based on sequence of term used, (ii) a similarity measure is defined that uses term sequence alignment score to assign relatedness for a pair of documents, and (iii) a dynamic programming based hierarchical agglomerative clustering (HAC) algorithm is proposed to cluster the documents. Moreover, the closely related works, (a) Frequent Itemset-based Hierarchical Clustering (FIHC) and (b) Text document clustering based on frequent word meaning sequences (CFWS) are extensively evaluated in comparison of proposed algorithm, on classical text mining datasets. The proposed algorithm significantly improves the quality of the clusters produced and is comparable to state-of-the-art text/document clustering algorithms..
منابع مشابه
An Application of the ABS LX Algorithm to Multiple Sequence Alignment
We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملMultiple sequence alignment using anchor points through generalized dynamic programming
A generalization of the dynamic programming algorithm applied to the multiple alignment of protein sequences is proposed. The algorithm has two main procedures: (i) local correspondences between sequences hereafter called anchor points are selected according to a criterion that combines local and global simlilarity values, (ii) the alignment is constructed recursively by choosing and linking to...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملA Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure
Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013